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For many optimization algorithms the time-to-solution depends not only on the problem size 
but also on the specific problem instance and may vary by many orders of magnitude. It is then 
necessary to investigate the full distribution and especially Its tail. Here we analyze the distributions 
of annealing times for simulated annealing and simulated quantum annealing (by path integral 
quantum Monte Carlo) for random Ising spin glass instances. We find power-law distributions with 
very heavy tails, corresponding to extremely hard instances, but far broader distributions - and 
thus worse performance for hard instances - for simulated quantum annealing than for simulated 
annealing. Fast, non-adiabatic, annealing schedules can improve the performance of simulated 
quantum annealing for very hard instances by many orders of magnitude. 


Non-convex optimization problems arise in a wide 
range of areas, from industrial applications to scientific 
research. Many challenging optimization problems be¬ 
long to the class of non-deterministic polynomial-time 
hard (NP-hard) problems. For these problems, which 
includes the famous traveling salesman problem, no ef¬ 
ficient algorithm is known that scales polynomially with 
problem size. In the absence of efficient exact solvers, 
often heuristic algorithms are the best choice. For many 
algorithms, and especially probabilistic annealing strate¬ 
gies [1-5], the time to find a (near)-optimal solution may 
depend not just on the problem size N, but may vary 
greatly - by many orders of magnitude - from instance 
to instance. In such cases one usually quotes the typical 
performance given by the median time-to-solution. 

Despite its common use, the median time-to-solution is 
often not the quantity that dictates efficiency, since one 
generally wants to solve more than half of the problem 
instances. In particular, in the case of broad distribu¬ 
tions of the time-to-solutions, higher quantiles (such as 
the 99-th percentile) may be better indicators of the per¬ 
formance of an optimization algorithm. In this Letter 
we demonstrate that when analyzing the performance of 
optimization algorithms, it is very important to consider 
the tails of the time-to-solution distribution and not only 
averages or median performances. 

As a specific example we consider the problem of find¬ 
ing the ground states of Ising spin glasses on chimera 
graphs. In these models N Ising spins Si that can take 
values ±1 should be chosen to minimize the total energy, 

Hp =—JijSiSj, ( 1 ) 

ihj) 

where the coupling constants Jij take random values on 
the edges of the so-called Chimera graph. For more infor¬ 
mation see the Supplementary Material. This quasi-two- 
dimensional graph with eight spins per unit cell is non- 
planar, which makes the problem of finding the ground 
state of the Ising spin glass NP-hard [7]. 

The Ising spin glass problem on the chimera graph has 
recently received special attention, since this graph is im¬ 


plemented in the optimization devices built by the Cana¬ 
dian company D-Wave systems [8], whose performance 
has been controversially discussed in the recent litera¬ 
ture [6, 9-13]. We summarize, in Fig. 1 previous results 
]6] of the scaling of the time-to-solution (TTS) as a func¬ 
tion of problem size N for three different approaches: a 
simulated annealer (SA), a simulated quantum annealer 
(SQA) using path integral quantum Monte Carlo, and a 
D-Wave Two device (DW2). Note one striking feature of 
these results: there is a huge spread in TTS between easy 
and hard instances (the 99th percentile), which varies by 
three orders of magnitude for SA, and by even more for 
SQA and DW2. The hard instances completely domi¬ 
nate the TTS. Below we will quantitatively analyze the 
distribution of TTS for the hardest instances, and show 
how it can be substantially reduced for SQA by using 
non-adiabatic annealing schedules. 

Simulated annealing (SA) is inspired by annealing, a 
process in metallurgy by which a material is heated up 
and then slowly cooled down in order to relieve internal 
stresses. In 1983, Kirkpatrick and coworkers ]1] showed 
that simulating an annealing process a Monte Carlo sim¬ 
ulation can be used as a general purpose heuristic solver 
for optimization problems, which is widely used in many 
application areas ]2]. By slowly decreasing the temper¬ 
ature the system can relax into a low-energy state, es¬ 
caping local minima by thermal activation over barriers. 
Implementation details are discussed in the Supplemen¬ 
tary Material. 

Inspired by the phenomenon of quantum tunneling, 
quantum annealing (QA) uses quantum fluctuations in¬ 
stead of thermal fluctuations to explore the configuration 
space ]3, 14-17]. It employs a time-dependent Hamilto¬ 
nian, 

H{t)= A{t)HD+B{t)Hp, (2) 

where Hp = is & driver Hamiltonian implement¬ 

ing quantum dynamics, and the problem Hamiltonian 
Hp = implements the Ising spin glass 

Eq. (1) using quantum spin-1/2 particles, and are 
Pauli matrices. At the beginning of the annealing sched¬ 
ule when A(0) = 1 and B{Q) = 0 we start with only 
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FIG. 1. Scaling of the annealing time to find the ground 
state with 99% probability for a bimodal Ising spin glass with 
couplings ±1 on chimera graphs with N spins using (a) a 
simulated annealer (SA), (b) a simulated quantum annealer 
(SQA) and (c) a D-Wave Two device (DW2). Results (with 
standard errors) are taken from Ref. [6]. We observe an enor¬ 
mous spread in the scaling of different percentiles, ranging 
from easy (1%) to hard (99%) problem instances. 


Ho. The spins are initialized in the ground state align¬ 
ing along the x-axis. During the annealing process we 
decrease A{t) and increase B{t), so that at the end of 
the annealing at time t = ta, yfe end up with the Ising 
spin glass of Eq. (1). The Hamiltonian is changed slowly 
enough such that the system stays in (or close to) the 
ground state at all times. This process takes place at 
a constant finite temperature T and therefore constant 
inverse temperature /3 = l/{kBT), where ko is the Boltz¬ 
mann constant. The D-Wave devices have been designed 
to implement quantum annealing using superconducting 
flux qubits [8] in a physical device. 

Quantum annealing can also be implemented as simu¬ 
lated quantum annealing (SQA) on a classical computer 
using a path integral quantum Monte Carlo simulation 
[4, 5, 18]. Furthermore, by replacing the quantum spins 
in Eq. (2) with two-dimensional classical rotor magnets 
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FIG. 2. Distribution of mean number of repetitions r required 
to find the ground state for 20000 problems with 200 spins on 
Ghimera topology for SA and SQA(with /? = 10). The tails 
of these distribution functions are decaying polynomially slow 
and there is a qualitative difference between SA and SQA: 
For SA the tail is decaying faster and the overall distribution 
function is more compact. 


a mean-field version (MFA) [10] can be obtained as a 
semi-classical version of SQA ]19]. Details of the SQA 
and MFA algorithms and their implementations are pre¬ 
sented in the Supplementary Material. Note that an¬ 
nealing times ta for SA, SQA and MFA are measured in 
numbers of sweeps, where one sweep is defined as one 
attempted update for each spin. 

For each problem size N we created 20000 different 
random problem instances with couplings Jij = ±1. For 
each instance, after determining the exact ground state 
energy using an exhaustive search algorithm we repeat¬ 
edly performed SA, SQA and MFA simulations until we 
found a ground state at least 100 times for each algo¬ 
rithm. For SQA with N = 288 spins and annealing time 
ta = 10^ sweeps we never found the ground state of the 
hardest instances despite billions of attempts. We thus 
focus on the case of N = 200 spins for our quantitative 
analysis, where all algorithms found the ground states of 
all problem instances with reasonable effort. From the 
number of repetitions required to find the ground state 
one hundred times we then calculated the single-run suc¬ 
cess probability s and the mean number of repetitions 
T = 1/s required to find the ground state. 

Plotting histograms of r for 20000 instances with N = 
200 we arrive at the distributions shown in Fig. 2. We 
see very slowly decaying power-law tails, consistent with 
the wide spread of quantiles in Fig. 1. We also see that 
while the median times are very similar (t = 2.1 for SA 
and r = 2.2 for SQA), there is a substantial quantitative 
difference between the distributions. For SA tails extend 
up to large values of r « 100, but for SQA to values 
of T « 10^ which is more than 5 orders of magnitude 
larger. The median times are thus neither indicative of 
the time needed to solve hard problem instances, nor of 
the relative performance of the two algorithms! 

To analyze this significant difference in the tail dis¬ 
tribution function in detail, we use extreme value the- 
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FIG. 3. Generalized Pareto probability density function 
w^,o,o-(2;) = -^W^fi^cr{x) plotted on a linear-log scale. The 
qualitative tail behavior depends on the shape parameter 
For ^ < 0 the probability density is bounded, for ^ = 0 it de¬ 
cays exponentially and for ^ > 0 the tail is heavy and decays 
as a power-law. 

ory (EVT) to obtain a good estimate for the tail dis¬ 
tribution of T given only a limited data set and specifi¬ 
cally to calculate the so-called shape parameter which 
gives information about its power-law tail. Let us de¬ 
note by F{x) = Pr{Al < a;} the unknown distribution 
function of a random variable X, e.g. the distribution 
of r for a specific algorithm. If F satisfies certain con¬ 
ditions (described in the Supplementary Material) then 
the Balkema-de Haan-Pickands (BdHP) theorem [20, 21] 
states that the tail of the distribution function, F{x) for x 
larger than a high threshold u, is approximated by a gen¬ 
eralized Pareto (GP) distribution function (ic), 

i.e. 

F{x) « W 5 ,u„,a„(a;) for x > u. (3) 

The GP distribution function with threshold/location pa¬ 
rameter u, shape parameter scale parameter tT„ > 0 is 
given by 

+ ( 4 ) 

which is defined on {x : x > u and (1 -I- ^(x — u)lau) > 0} 
[22-24]. For ^ = 0 one takes the limit ^ t 0, 

bTo.«,o-„(a:) = 1 - exp (-(x - u)/cr„), x>u. (5) 

The qualitative behavior of the GP distribution func¬ 
tion is given by its shape parameter which turns out to 
be very useful when benchmarking different algorithms. 
The shape parameter is independent of the threshold u 
and of any linear transformations on A", e.g. considering 
the number of spin flips to find the ground state instead 
of T. The probability density has three distinct tail be¬ 
haviors, shown in Fig. 3, depending on the sign of It 
is bounded for ^ < 0, exponentially decaying for ^ = 0 
and slowly decaying as a power-law for ^ > 0, which we 
also refer to as having a heavy tail. Note that the A:-th 
moment of W{,«,cr„(x) diverges for fc > 1/^ [25] . 


T r 

FIG. 4. Distribution of mean number of repetitions r required 
to find the ground state for 20000 problem instances with 
N = 200 spins for (a) SQA with = 10 and (b) SA with either 
the optimal number or 10'* sweeps. While the distribution 
changes only slightly for SA, surprisingly the tails in SQA 
decrease much more rapidly when annealing faster. 


If a heuristic algorithm has ^ > 0, then hard instances 
dominate the average time-to-solution. If the shape pa¬ 
rameter 5 > 1 the average run-time of an instance is 
infinite. In our case of discrete couplings ±1, there is 
only a finite set of instances for each N. The conditions 
of the the BdHP theorem are thus not satisfied because 
the distribution of t is discrete with only a finite num¬ 
ber of points in its support [26, p. 217]. Nevertheless 
we can still use the GP distribution to analyze our data, 
and obtain excellent fits (see Supplementary Material). 
The main consequence of a finite set of instances is that 
F{t) differs from the GP distribution function for the 
hardest instances, and thus for example the mean r is 
not infinite but dominated by the hardest instances. In 
the Supplementary Material we show that the mean is 
not converging for ^ > 1 using 20000 problem instances. 

We start, by investigating the effects of annealing time 
ta on the distribution. In Fig. 4 we compare the distri¬ 
butions for ta = 10'* sweeps (which for SQA with /3 = 10 
was found to give good correlations with a D-Wave One 
device [9]) to simulations run with the (size-dependent) 
optimal that minimizes the total effort taX (see Ref. 
6 and the Supplementary Material). The surprising re¬ 
sult is that performing SQA with a fast schedule of only 
= 150 sweeps leads to a more compact distribution 
with much faster decaying tails. While easy instances 
may need a few more repetitions, hard instances need far 
less repetitions to find the ground state. 

As the correlation plot in Fig. 5 shows, the relative 
hardness of an instance doesn’t change significantly when 
changing ta, but the single-run success probability s of 
SQA increases for more than 10% of the instances when 
SQA is run faster. This is opposite to SA where for al¬ 
most all instances s decreases when annealing faster. In 
both cases, however, the total effort rta decreases for al¬ 
most all instances. Even though more repetitions t are 
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FIG. 5. Correlation plot of (a) SQA with /3 = 10 and (b) 
SA for all 20000 instances with N = 200 spins using and 
ta = 10"" sweeps respectively. Points below the white line 
(t,t) are those where the success probability s increases upon 
annealing faster. This is the case for 2312 instances for SQA 
but only 11 for SA. This increase in s for the hard instances 
explains the more rapid decrease of the tail in SQA. The grey 
line indicates points with equal total effort taT. Instances 
below the grey line (all but one for SQA and all but 51 for 
SA) are those which get solved more efficiently with a faster 
annealing time. 


needed, this is more than compensated by the faster an¬ 
nealing time. For SQA, the combined effect of needing 
688 times fewer repetitions and performing only 150 in¬ 
stead of 10^ sweeps per repetition reduces the total effort 
for the hardest of our instances by a factor 45866! 

To quantify the difference in the tail distribution func¬ 
tions we fit them to GP distribution functions and plot 
the size-dependence of the shape parameter ^ for SA, 
SQA and MFA in Fig. 6. For SQA we use the two sched¬ 
ules discussed above and additionally a third schedule, 
with sweeps but a higher temperature /3 = 4, 

which is the optimal temperature for MFA (see Sup¬ 
plementary Material). All algorithms have heavy tails, 
^ > 0, for problems with size N > 72, and ^ is in¬ 
creasing with system size. For SQA using /3 = 10 and 
ta = 10^ we find ^ > 1 already for N > 72, indicating 
that the mean time-to-solution is already divergent and 
dominated by the hardest instances. As we already saw 
above for N = 200, the tail behavior of SQA improves 



Problem size y/N 

FIG. 6. Shown are the fitted shape parameters ^ with stan¬ 
dard errors for SA, MFA and SQA. The different versions of 
SQA have different inverse temperature /3 and ta, with (N) 
varying from 20 to 400 sweeps depending on N and jS. For 
^ > 0 (indicated by a dashed line) the tail of the distribution 
function is heavy and for ^ > 1 (indicated by a dotted line) 
the mean diverges. The fcth moment diverges for k>l/^. 


when using faster annealing schedules, and further small 
improvements are obtained by raising the temperature. 
While SQA with the best setting and MFA have the same 

SA has smaller indicating a faster decaying tail and 
better performance on hard instances. 

Our results show that the tail behavior is the impor¬ 
tant measure of performance when considering the per¬ 
formance of an optimization algorithm since the hard¬ 
ness of an instance is a priori unknown. Therefore one 
has to choose a large enough run-time such that a high 
percentile of random instances are solved. Considering 
just the typical (median) performance of an optimiza¬ 
tion algorithm can be misleading in the presence of broad 
heavy-tailed distributions. The median effort misses the 
important fact that already at moderate problem sizes 
the mean time-to-solution can diverge, indicating that 
the hardest instances dominate the average. As one typ¬ 
ically wants to solve much more than half of the prob¬ 
lem instances, reducing the tails of the distribution is 
crucial to optimize the performance of an optimization 
algorithm. 

One surprising result we obtained here is that for sim¬ 
ulated quantum annealing, fast non-adiabatic schedules 
show far superior performance than slow adiabatic an¬ 
nealing schedules that would naively seem superior. This 
observation is consistent with results obtained in Refs. 
[18, 27] which showed that for some instances fast anneal¬ 
ing schedules can help a (simulated) quantum annealer 
escape local minima. As good agreement was found be¬ 
tween SQA and the D-Wave devices, we expect that simi¬ 
larly faster annealing schedules can help the performance 
not only of simulated quantum annealers but also of phys¬ 
ical quantum annealing devices. 

The tail behavior is also important for extrapolating 
the performance of annealers to larger sizes. The tails 
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are indicative of harder problems that will dominate for 
larger problem sizes. We thus expect scaling differences 
between different annealers to first show up in the tails. 
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I. ANNEALING ALGORITHMS 

In the following we present implementation details for 
the three annealing algorithms. 


A. Simulated annealing 

SA is a Monte Carlo algorithm which generates states 
of a classical system in thermal equilibrium whose energy 
is described by the following Hamiltonian 

Hp = ( 6 ) 

{i,3) 

Our SA code generates new states by flipping a single 
spin and accepting the spin flip according to the Metropo¬ 
lis update rule [28]. Hence, the probability to generate 
a state with energy E is proportional to the Boltzmann 
weight exp{—f]E), where /3 = iKkpT) is the inverse tem¬ 
perature. Over the annealing time which is measured 
in numbers of sweeps (one sweep is an attempted flip 
of every spin in the system), the temperature is slowly 
decreased. In the beginning, when the temperature is 
high, SA has a large probability to generate high energy 
states, called thermal excitations, which allow SA to es¬ 
cape from a local minimum. In the end, the temperature 
is low such that mostly states with lower energy are ac¬ 
cepted and the system relaxes into the local minimum. 
For these instances with = ±1, we linearly increase 
the inverse temperature /3(t) = 0.1 -I- 2.9 x t/ta over the 
annealing time ta- In order to reduce CPU time, our im¬ 
plementation uses forward computation of the energies 
as described in Ref. [29]. 

B. Simulated quantum annealing 

SQA is a Monte Carlo algorithm which generates 
states of a quantum system in thermal equilibrium. 
During the annealing the temperature is kept constant. 
The Hamiltonian of the system is time dependent and 
given by Eq. (2) in the main paper. At the beginning 
of the annealing schedule when A(0) = 1 and 5(0) = 0 
we start with only EIp. We use a linear schedule, 
i.e. we linearly decrease A{t) = 1 — t/ta and increase 
B{t) = t/ta, so that at the end of the annealing at 
time t = ta, the Hamiltonian of the system is given by 
Elp. SQA uses the Suzuki-Trotter formalism with which 
a d-dimensional Ising model with a transverse held is 
mapped to a {d + l)-dimensional Ising model at finite 
temperature [30] . Here we use a discrete-time path 
integral formulation with 32 imaginary time slices and 
we build clusters along the imaginary time direction. 


On a technical level, we exploit a random bit algorithm 
explained in Ref. [31] to form new domain walls in all 
32 Trotter slices simultaneously. 

We analyze SQA as simulation of a physical system, 
which means that in the end of annealing, the success 
probability of finding the ground state is given by the 
number of Trotter slices which reached the ground state 
divided by 32 (the total number of Trotter slices). This 
means for example that if half the Trotter slices are in 
the ground state, we count this run as having found the 
ground state only with probability 0.5, because an actual 
quantum system would collapse into one of the Trotter 
slices. One could on contrary also analyze SQA as a 
purely classical algorithm and define that a single run 
has found the ground state energy with probability 1 if 
at least one Trotter slices is in the ground state, as one 
can check all Trotter slices in a classical code. This way 
of analyzing SQA resulted in very similar shape parame¬ 
ters ^ of the fitted GP distribution functions for r, which 
means there is no qualitative difference in the tail behav¬ 
ior and therefore we only show the results of analyzing 
SQA as a simulation of a physical system throughout the 
whole paper. 


C. Mean-fleld quantum annealing 

MFA is a mean-fleld version of simulated quantum an¬ 
nealing. A spin i is represented by a classical magnet 
pointing in some direction 9^ in the XZ plane [10] . The 
classical time-dependent Hamiltonian is then given by: 

Sill ^ ^ 

(7) 

MFA is a Monte Carlo algorithm which generates states 
of this system at thermal equilibrium and the temper¬ 
ature is again kept constant. A new state is generated 
by proposing an update of spin i by choosing uniform 
randomly a new angle 9i S [0, 27r) and then accepting 
or rejecting this update according to the Metropolis up¬ 
date rule [28] . We use the same linear schedule for the 
constants A{t) and Bit) as in the case for SQA and all 
spins are initialized to be aligned in x-direction at t = 0. 
Our MFA implementation follows the description in Ref. 
[10], but differs by using 1024 element lookup tables with 
precomputed values of sin{9) and cos{6). 


II. PROBLEM INSTANGES 

The present analysis was carried out on Ising spin 
glasses with Chimera topology. The Chimera graph is 
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FIG. 7. Chimera graph. Shown is a 128 spin Chimera 
graph consisting of 4 x 4 unit cells each containing 8 spins. 
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FIG. 8. Choosing the optimal annealing time per run 
for SQA at /3 = 10 and problems with N = 200 spins. The 
single-run success probability s is estimated for 1000 random 
instances for SQA with different annealing times ta by making 
12800 repetitions for each instance and ta- Then the mean 
number of repetitions t — 1/s required to find the ground 
state is calculated and the total effort taT to find the ground 
state is plotted (with 95% confidence intervals). The optimal 
annealing time is the ta which minimizes taT for most 
quantiles. For this case t//"^ = 150 sweeps. Note that the 
minimum of each quantile is at the same ta and very flat. 


formed by a two-dimensional lattice built from eight-spin 
unit cells that each form a complete bipartite graph 7 ^ 44 . 
Fig. 7 shows the chimera graph for 128 spins, correspond¬ 
ing to 4 X 4 unit cells. We used graphs with L x L unit 
cells and 8L^ spins, with L ranging from 2 to 8. The 
couplings between the spins were randomly chosen to be 
± 1 . 

To find the ground states of these Ising spin glasses, 
we implemented an exact solver using dynamic program¬ 
ming. The solver exploits the limited tree-width of the 
Chimera graph which allows one to perform an exhaus¬ 
tive search up to 512 spins. 

Note that since Chimera graphs are bipartite, all our 
Monte Carlo annealing algorithms could potentially be 
parallelized efficiently because spins in one sublattice 
couple only to spins in the other sublattice and hence all 
spins in one sublattice can be updated simultaneously. 


III. OPTIMAL NUMBER OF SWEEPS 

When investigating the scaling of annealers with prob¬ 
lem size N one has to carefully choose optimal param¬ 
eters or one might draw wrong conclusions [6, 9]. One 
could, for instance, choose to run the annealer N times 
and each time anneal for time ta = t and then choose the 
lowest lying state found, or, as an extreme case example, 
one could choose to make a single annealing run over a 
time span ta = N - t. The result in the two cases are very 
different, even though the total annealing time is equal 
in both cases. 

To make extrapolations to asymptotic scaling one 


N 

tT of SA 

of SQA 

of MFA 


/3 = 10 

II 

0 

t-H 

II 

II 

32 

10 

20 

20 

20 

50 

72 

25 

50 

50 

100 

100 

128 

70 

100 

100 

500 

200 

200 

200 

150 

150 

2000 

1000 

288 

300 

200 

400 

5000 

2000 

392 

650 

400 

1000 


5000 

512 

1500 

400 

1000 


5000 


TABLE I. Optimal annealing time t//'^ (given in numbers of 
sweeps) for SA, SQA or MFA to solve Ising spin glass prob¬ 
lems on Chimera graphs with ±1 couplings and N spins. For 
SQA and MFA we estimated the optimal annealing time for 
different inverse temperatures /3. 


needs to use an optimal annealing time which is 
defined to be the number of sweeps ta which minimizes 
the total effort taT, where r is the mean number of repeti¬ 
tions required to find the ground state. See Fig. 8. Note 
that the minima for the different quantiles are usually at 
roughly the same ta and the minima are quite broad and 
flat. Thus does not have to be determined with very 
high accuracy. The optimal annealing times t°P^{N) are 
listed in Tab. I for SA, SQA and MFA. 
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bu and a„, a continuous limiting distribution function as 
u goes to the right endpoint oj{F) of F, which is also 
called the extreme value condition [22], then: 




—^ 0 , 


for u w(F), X > u. 


with threshold/location parameter u, shape parameter 
scale parameter (t^ > 0 and {x) is a generalized 

Pareto (GP) distribution function 


Wi,u,aAx) := 1 - 



X — u 

^ U 



(9) 


FIG. 9. Determining the optimal inverse temperature /3 for 
MFA. Plotted are the quantiles of the total effort, re¬ 

quired to find the ground state (with 95% confidence inter¬ 
vals) for different values of /3. At each /? the optimal annealing 
time is calculated and used in this plot. The optimal in¬ 
verse temperature is (? = 4. 

IV. OPTIMAL TEMPERATURE OF MFA 

For the MFA algorithm we determined that the opti¬ 
mal inverse temperature is /? = 4 which minimizes the 
total effort required to find the ground state, taT, for 
most quantiles and problem sizes N. See Fig. 9 which 
shows the case for N = 200. 


V. EXTREME VALUE THEORY 


which is defined on {x : x > u and (1 -|- ^(x — u)/au) > 0} 
[22-24]. For ^ = 0 one has to take the limit ^ 0: 

= 1 - exp (-(x - , x>u. (10) 

For a large enough threshold u this theorem provides a 
parametric model for the exceedance distribution func¬ 
tion F[“1 (x) « The parameters ^ and au 

can be estimated by fitting the exceedances to 
as explained later. If (x) « (x) is a good ap¬ 

proximation, then it follows from the definition of (x) 
that [22] 

F{x) ~ FF{,ii„,a'„(a;) for x > u with (11) 

= cTu! {F{u))) and (12) 


In this section we summarize the important parts of 
Extreme Value Theory (EVT) which are used in the 
main paper. A detailed discussion can be found in the 
cited papers and textbooks. Here we will state the 
Balkema-de Haan-Pickands (BdHP) theorem, which says 
that if F{x) satisfies the extreme value condition, then 
F{x) « for X > u. We will then discuss why 

our F{x) doesn’t satisfy the extreme value condition, but 
show that nevertheless our empirical data indicates that 
F{x) « ^(,ii^,a^(x) for X > rt is a good model for the 
range of x which we are interested. 

A. Theory 

Let A be a random variable, i.e. the run-time of an 
instance using a specific algorithm, which is described by 
the unknown distribution function F(x) = Pr{A < x}. 

Then the exceedance distribution function over a 
threshold u, F[“1, is defined as 

F^"^^ (x) := Pr {X < x | A > u} = for x > u. 

1 — F(u) 

( 8 ) 

The Balkema-de Haan-Pickands theorem[20, 21] now says 
that if fN {bu + a„ x) has, for some choice of constants 


Uu = u- ci„VF^ Q 1 (T’(m)) . (13) 

The shape parameters ^ of the GP distribution func¬ 
tion which approximates F[“1 (x) or F(x) are identical. 
And in contrast to the scale parameter tT„ which is de¬ 
pendent on the chosen threshold u, the shape parame¬ 
ter is independent of u, i.e. suppose that the threshold 
u is high enough such that F(“1 (x) « then 

if exceedances over a higher threshold ji > u are fit¬ 
ted, it results in the same shape parameter ^ because 
of the peak-over-threshold stability of GP distribution 
functions : ^ with cr-f ^(/x-n) > 0 

[ 22 ]. 

In practice GP distribution functions are used as the 
canonical distribution function for modeling exceedances 
over a high threshold, see for example [25]. Moreover, 
as the distribution function of interest F{x) is unknown, 
it is commonly assumed as a null-hypothesis that F(x) 
satisfies the extreme value condition and hence it can 
be approximated by a GP distribution function above 
a high threshold. Tests on the data and fitted model 
should be performed to potentially reject the null- 
hypothesis, which we introduce in the next subsection. 
The estimated GP distribution functions are often used 
to make predictions of rare events, which are even more 
extreme than what has already been observed and used 
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FIG. 10. Dietrich, de Haan and Hiisler’s method to check the 
null-hypothesis [33] with r] — 2. Reject the null-hypothesis 
if the test statistic is larger than the 0.95 quantile of the 
limiting random variable. The full data set has 20’000 dif¬ 
ferent instances. The test statistic is calculated for different 
threshold values plotted according to the number of thresh¬ 
old exceedances k. The null-hypothesis has to be rejected for 
thresholds with more than 1480 exceedances. This means we 
have to choose a threshold high enough such that there are 
less than 1480 exceedances in this data set. Plotted is SA 
with N = 200 spins and optimal number of sweeps t°^^. 


FIG. 11. Fitted shape parameters ^ with standard errors for 
problems with different number of spins N depending on the 
threshold u. This plot is used to determine a high enough 
threshold. ^ should be constant for thresholds higher than a 
high enough threshold. The threshold on the x-axis is given 
in percentiles or number of threshold exceedances k from the 
total of 20000 random instances. Drees [35, p.237[ mentions 
that it is a good idea to plot the estimated shape parameter 
as a function of log(A:). Plotted is the data for SA with the 
optimal number of sweeps ta^^{N). 


for the fitting of the GP distribution parameters. For 
example, extreme value analysis is used to plan the 
heights of sea dikes in the Netherlands such they fail 
no more than once in lO’OOO years on average, but the 
data for the model fitting consists of the sea level from 
the last 100 years only [32]. In our case the extreme 
value condition is not satisfied, because as already 
stated in the main paper, our distribution functions of 
T are discrete with only a finite number of points in 
its support, and therefore the extreme value condition 
doesn’t hold, see [26, p.217]. Nevertheless we assume as 
a null-hypothesis that F{x) « (a;) in the range 

of X above the threshold u which we are interested. We 
are not interested in the very largest x, e.g. the hardest 
0.999999 quantile of our instances, because our problem 
of finding the ground state energy is NP hard, so the 
hardest instances are commonly assumed to be out of 
reach for current algorithms. In the next subsection we 
will see that the null-hypothesis cannot be rejected for 
our range of data and that the obtained GP distribution 
functions show excellent model fits, which justifies the 
use of GP distribution functions to fit our data. 


B. Fitting Procedure 

Here we introduce tests which could reject the null- 
hypothesis that our unknown distribution function satis¬ 
fies the extreme value condition and show how to choose 
a high enough threshold u. 

A common statistical approach to check the null- 


hypothesis is to estimate the parameters of the GP dis¬ 
tribution function from the exceedances over a threshold 
of the data. Based on these estimators, tail probabilities 
and quantiles are deduced. Then graphical methods are 
used to analyze the goodness-of-fit using for example QQ 
and PP plots [22, p.l45]. These graphical methods are 
introduced in the next subsection and the results didn’t 
reject the null-hypothesis. In addition we used a method 
presented in [33] to test the null-hypothesis, that F sat¬ 
isfies the extreme value condition with some additional 
(second order) condition. A review is given in [22, p.l44- 
151] and in the paper by Hfisler and Li [34], where they 
also provide a R program code. This code is used in our 
paper to calculate the test statistics and the 0.95 quan¬ 
tiles of the limiting random variables by moment esti¬ 
mation for different number of threshold exceedances, an 
example is shown in Fig. 10. The test statistic depends on 
the chosen threshold value. We saw that if the threshold 
value was too low, this test rejects the null-hypothesis, 
which provides us a hint on the smallest possible thresh¬ 
old value. For all the chosen thresholds in this paper this 
test method doesn’t reject the null-hypothesis. 

There are automatic ways of choosing a high enough 
threshold [22, p.l37], however, we used two manual 
method which might achieve better fitting results: First 
a range of thresholds is chosen and the exceedances over 
these thresholds are used to estimate the parameters of 
the GP distribution functions by the Maximum Likeli¬ 
hood Estimation (MLE) method [24, 36] . We used MLE 
implementations for extreme value theory from the two 
libraries ismev and evir [37, 38] and 20’000 random prob¬ 
lems are used for every algorithm, schedule and system 
size. The estimated shape parameter ^ as a function 
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Model 



FIG. 12. The exceedance distribution of r over a high thresh¬ 
old for SA with 512 spins and optimal number of sweeps 
is fitted to a GP distribution function. The goodness of the 
model fit is shown here with (a) PP and (b) QQ plot. As 
the points are close to the unit diagonal it indicates that the 
estimated GP distribution function is an excellent fit for the 
data. 


of the threshold is plotted in Fig. 11. Because of the 
peak-over-threshold stability of the GP distribution func¬ 
tion, the estimated shape parameter should be constant 
for thresholds fji > u \i the threshold u is already high 
enough. Therefore, there are usually three regions in 
Fig. 11: For a very high threshold there are only a few 
exceedances and hence the shape parameter fluctuates 
strongly. For a high threshold the shape parameter is 
stable around the true value. For a too low threshold 
there are too many exceedances and the extreme value 
condition might not be satisfied, this can be detected if 
the shape parameter is not constant or by testing the 
null-hypothesis, see Fig. 10. Now we have a range of 
threshold which seem to be high enough. To choose the 
exact threshold we used the PP and QQ plots to check 
which threshold shows the best model fit. 


of independent observations 

2 ^( 1 ) < X(^2) < • • • < X(^n) 

the empirical distribution function F is defined by 
~ % 

Suppose F is an estimated distribution function of F, 
then a probability plot, also called PP plot, consists of 
the points 

and a quantile plot, also called QQ plot, consists of the 
points 



where F ^ is the estimated quantile function [24, p.36]. If 
F is a good estimate of the distribution function F, then 
the points in a QQ plot and in a PP plot should lie close 
to the unit diagonal. An example is shown in Fig. 12. 
QQ plots turned out to much more sensitive about the 
threshold choice than PP plots. All PP and QQ plots 
show that the estimated model agrees excellently with 
the data. 


VI. RUNNING MEAN 

In the main paper we show that the tail of the distri¬ 
bution of r for SQA is always heavy > 0) for problems 
with N >72 spins. Sometimes the estimated shape pa¬ 
rameter ^ is even > 1 indicating that the mean of the GP 
distribution function is infinite but, as there exists only 
a finite number of instances, the mean of the distribu¬ 
tion function of r is not divergent but dominated by the 
hardest instances. This can easily be seen by considering 
the running mean of r, i.e. the mean of r for the first n 
instances. See Fig. 13 which shows that the mean is not 
converging using 20000 random instances with 288 spins 
for SQA with /3 = 10 and optimal annealing time 
This is expected because the estimated shape parameter 
^ is 1.10 ±0.07. 


Evaluation of Model Fit 


The goodness of the model fit can be graphically 
checked with PP and QQ plots. Given an ordered sample 
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FIG. 13. Plotted is the mean of r of the Hrst n instances for 
SQA with /3 = 10 and optimal annealing time for prob¬ 
lems with 288 spins. We see that the mean is not converging 
using 20000 random instances and single instances dominate 
the mean, e.g. instance 16773 has r = 7333790 and therefore 
moves the mean significantly upwards. This behavior can be 
expected as the estimated shape parameter of the GP distri¬ 
bution function is ^ = 1.10 ± 0.07 which means the mean of 
the GP distribution function is diverging. 


VII. DATA 


Here we show the results of the tail analysis and repeat 
the necessary details from the main text in order to be 
self-contained in the presentation of the data. 


For every algorithm, schedule and system size N we 
used 20000 random instances to estimate for each in¬ 
stance the single-run success probability s to find the 
ground state by averaging over many repetitions. Then 
the mean number of repetitions t = 1/s required to find 
the ground state is calculated. The distribution func¬ 
tion F{t) is analyzed using extreme value theory: The 
exceedances over a high threshold u are fitted to a GP 
distribution function, such that In 

the tables below are the results for SA, SQA and MFA: 
N is the number of spins, u is the threshold, k are the 
number of threshold exceedances, i.e. the number of in¬ 
stances that have t > u from all the 20000 instances, 
^ is the estimated shape parameter and tT„ is the esti¬ 
mated scale parameter of the GP distribution function 
Both ^ and cr„ are shown with the standard 
error obtained from the maximum likelihood estimation 
method. 


N 

k 

U 



32 

200 

12.9 

-0.32 ±0.06 

3.61 ±0.33 

72 

1000 

21.3 

0.06 ±0.03 

7.79 ±0.36 

128 

1000 

37.0 

0.26 ± 0.04 

14.6 ±0.7 

200 

1000 

58.4 

0.35 ± 0.04 

29.6 ± 1.5 

288 

998 

165 

0.51 ±0.05 

102 ±6 

392 

998 

317 

0.52 ±0.05 

260 ± 14 

512 

400 

1208 

0.72 ±0.09 

1086 ± 103 


TABLE II. Parameters for SA with optimal annealing time 
^opt(jY) Note by comparing to Tab. Ill that the estimated 
shape parameter ^ is less dependent on the number of sweeps 
than in the case of SQA. 


N 

k 

u 



32 

194 

1.5 

-0.06 ±0.06 

0.199 ±0.019 

72 

599 

2.4 

0.31 ±0.05 

0.564 ±0.037 

128 

400 

5.8 

0.51 ±0.08 

2.29 ±0.20 

200 

1000 

8.6 

0.48 ±0.05 

5.24 ±0.29 

288 

999 

20.1 

0.59 ±0.05 

14.5 ±0.8 

392 

599 

79.0 

0.53 ±0.07 

74.0 ±5.5 

512 

399 

373 

0.86 ±0.10 

324 ± 32 

TABLE III. Parameters for SA with ta 

= 10000 sweeps. 


N 

k 

u 


^ u 

32 

200 

1.6 

0.63 ±0.12 

0.813 ±0.110 

72 

200 

26.0 

1.27 ±0.16 

23.0 ± 3.4 

128 

400 

113 

1.71 ±0.14 

238 ± 28 

200 

400 

1381 

1.85 ±0.14 

2931 ± 354 

288 

318 

36281 

1.86 ±0.15 

88127±1485 


TABLE IV. Parameters for SQA with /3 = 10 and ta = 10000 
sweeps. Note that the ground state of the hardest instance 
with N = 288 spins was never found in 2’373’660’226 repeti¬ 
tions. We therefore carried out the analysis using an upper 
bound on the success probability of s = 4.21 x 10“^° for this 
instance. 
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N 

k 

U 


^ u 

32 

400 

7.9 

0.07 ±0.06 

2.10 ±0.16 

72 

200 

23.2 

0.31 ±0.09 

8.00 ±0.89 

128 

400 

47.3 

0.47 ±0.07 

26.1 ± 2.3 

200 

400 

219 

0.76 ±0.09 

170 ± 16 

288 

1000 

603 

1.10 ±0.07 

577 ± 37 


TABLE V. Parameters for SQA with /3 = 10 and optimal 
annealing time 


N 

k 

u 

e 


32 

400 

7.6 

-0.03 ±0.05 

1.96 ±0.14 

72 

198 

42.4 

0.17 ±0.08 

10.5 ± 1.1 

128 

398 

140 

0.36 ±0.07 

58.9 ±4.8 

200 

400 

226 

0.60 ±0.08 

139 ± 12 

288 

400 

998 

0.90 ±0.10 

812 ±80 

392 

198 

9302 

0.96 ±0.14 

10071±1779 


TABLE VII. Parameters for MFA with optimal /? = 4 and 
optimal annealing time ta^^{N). 


N 

k 

u 


u 

32 

400 

6.5 

-0.01 ±0.05 

1.71 ±0.13 

72 

400 

15.7 

0.19 ±0.06 

5.62 ±0.41 

128 

398 

48.3 

0.40 ±0.07 

22.1 ± 1.8 

200 

200 

318 

0.54 ± 0.11 

187 ± 23 

288 

400 

648 

0.75 ±0.09 

629 ± 59 


TABLE VI. Parameters for SQA with /3 = 4 and optimal 
annealing time 


N 

k 

u 



32 

298 

23.5 

-0.19 ±0.05 

6.92 ±0.53 

72 

398 

54.5 

0.06 ±0.06 

18.6 ± 1.4 

128 

100 

256 

0.58 ±0.15 

105 ± 18 

200 

398 

459 

0.86 ±0.09 

428 ± 42 

288 

399 

3419 

1.35 ±0.12 

4312 ±477 


TABLE VIII. Parameters for MFA with /3 = 10 and optimal 
annealing time 



